108 research outputs found
Application-Specific Number Representation
Reconfigurable devices, such as Field Programmable Gate Arrays (FPGAs), enable application-
specific number representations. Well-known number formats include fixed-point, floating-
point, logarithmic number system (LNS), and residue number system (RNS). Such different
number representations lead to different arithmetic designs and error behaviours, thus produc-
ing implementations with different performance, accuracy, and cost.
To investigate the design options in number representations, the first part of this thesis presents
a platform that enables automated exploration of the number representation design space. The
second part of the thesis shows case studies that optimise the designs for area, latency or
throughput from the perspective of number representations.
Automated design space exploration in the first part addresses the following two major issues:
ÂČ Automation requires arithmetic unit generation. This thesis provides optimised
arithmetic library generators for logarithmic and residue arithmetic units, which support
a wide range of bit widths and achieve significant improvement over previous designs.
ÂČ Generation of arithmetic units requires specifying the bit widths for each
variable. This thesis describes an automatic bit-width optimisation tool called R-Tool,
which combines dynamic and static analysis methods, and supports different number
systems (fixed-point, floating-point, and LNS numbers).
Putting it all together, the second part explores the effects of application-specific number
representation on practical benchmarks, such as radiative Monte Carlo simulation, and seismic
imaging computations. Experimental results show that customising the number representations
brings benefits to hardware implementations: by selecting a more appropriate number format,
we can reduce the area cost by up to 73.5% and improve the throughput by 14.2% to 34.1%; by
performing the bit-width optimisation, we can further reduce the area cost by 9.7% to 17.3%.
On the performance side, hardware implementations with customised number formats achieve
5 to potentially over 40 times speedup over software implementations
Large-Scale Automatic K-Means Clustering for Heterogeneous Many-Core Supercomputer
Funding: UK EPSRC grants âDiscoveryâ EP/P020631/1, âABC: Adaptive Brokerage for the Cloudâ EP/R010528/1.This article presents an automatic k-means clustering solution targeting the Sunway TaihuLight supercomputer. We ïŹrst introduce a multilevel parallel partition approach that not only partitions by dataïŹow and centroid, but also by dimension, which unlocks the potential of the hierarchical parallelism in the heterogeneous many-core processor and the system architecture of the supercomputer. The parallel design is able to process large-scale clustering problems with up to 196,608 dimensions and over 160,000 targeting centroids, while maintaining high performance and high scalability. Furthermore, we propose an automatic hyper-parameter determination process for k-means clustering, by automatically generating and executing the clustering tasks with a set of candidate hyper-parameter, and then determining the optimal hyper-parameter using a proposed evaluation method. The proposed auto-clustering solution can not only achieve high performance and scalability for problems with massive high-dimensional data, but also support clustering without sufïŹcient prior knowledge for the number of targeted clusters, which can potentially increase the scope of k-means algorithm to new application areas.PostprintPeer reviewe
Giant thermal transport tuning at a metal/ferroelectric interface
Interfacial thermal transport plays a prominent role in the thermal management of nanoscale objects and is of fundamental importance for basic research and nanodevices. At metal/insulator interfaces, a configuration commonly found in electronic devices, heat transport strongly depends upon the effective energy transfer from thermalized electrons in the metal to the phonons in the insulator. However, the mechanism of interfacial electronâphonon coupling and thermal transport at metal/insulator interfaces is not well understood. Here, the observation of a substantial enhancement of the interfacial thermal resistance and the important role of surface charges at the metal/ferroelectric interface in an Al/BiFeO3 membrane are reported. By applying uniaxial strain, the interfacial thermal resistance can be varied substantially (up to an order of magnitude), which is attributed to the renormalized interfacial electronâphonon coupling caused by the charge redistribution at the interface due to the polarization rotation. These results imply that surface charges at a metal/insulator interface can substantially enhance the interfacial electronâphonon-mediated thermal coupling, providing a new route to optimize the thermal transport performance in next-generation nanodevices, power electronics, and thermal logic devices.Peer ReviewedPostprint (author's final draft
Validating quantum-supremacy experiments with exact and fast tensor network contraction
The quantum circuits that declare quantum supremacy, such as Google Sycamore
[Nature \textbf{574}, 505 (2019)], raises a paradox in building reliable result
references. While simulation on traditional computers seems the sole way to
provide reliable verification, the required run time is doomed with an
exponentially-increasing compute complexity. To find a way to validate current
``quantum-supremacy" circuits with more than qubits, we propose a
simulation method that exploits the ``classical advantage" (the inherent
``store-and-compute" operation mode of von Neumann machines) of current
supercomputers, and computes uncorrelated amplitudes of a random quantum
circuit with an optimal reuse of the intermediate results and a minimal memory
overhead throughout the process. Such a reuse strategy reduces the original
linear scaling of the total compute cost against the number of amplitudes to a
sublinear pattern, with greater reduction for more amplitudes. Based on a
well-optimized implementation of this method on a new-generation Sunway
supercomputer, we directly verify Sycamore by computing three million exact
amplitudes for the experimentally generated bitstrings, obtaining an XEB
fidelity of which closely matches the estimated value of .
Our computation scales up to cores with a sustained
single-precision performance of Pflops, which is accomplished within
days. Our method has a far-reaching impact in solving quantum many-body
problems, statistical problems as well as combinatorial optimization problems
where one often needs to contract many tensor networks which share a
significant portion of tensors in common.Comment: 7 pages, 4 figures, comments are welcome
Large-scale hierarchical k-means for heterogeneous many-core supercomputers
Funding: J.Thomson and T.Yu are supported by the EPSRC grants âDiscoveryâ EP/P020631/1, âABC: Adaptive Brokerage for the Cloudâ EP/R010528/1, and EU Horizon 2020 grant Team-Play: âTime, Energy and security Analysis for Multi/Many-core heterogenous PLAtformsâ (ICT-779882, https://teamplay- h2020.eu)This paper presents a novel design and implementation of k-means clustering algorithm targeting the Sunway TaihuLight supercomputer. We introduce a multi-level parallel partition approach that not only partitions by dataflow and centroid, but also by dimension. Our multi-level (nkd) approach unlocks the potential of the hierarchical parallelism in the SW26010 heterogeneous many-core processor and the system architecture of the supercomputer. Our design is able to process large-scale clustering problems with up to 196,608 dimensions and over 160,000 targeting centroids, while maintaining high performance and high scalability, significantly improving the capability of k-means over previous approaches. The evaluation shows our implementation achieves performance of less than 18 seconds per iteration for a large-scale clustering case with 196,608 data dimensions and 2,000 centroids by applying 4,096 nodes (1,064,496 cores) in parallel, making k-means a more feasible solution for complex scenarios.Postprin
- âŠ